Step 1: Ask

Business Task

This analysis was initiated by Urška Sršen, co-founder of Bellabeat. Sršen knows that an analysis of available data on consumers’ smart device usage would reveal opportunities for the company to grow, and has provided the following business task:

Analyse smart device usage data to gain insight into how people are already using smart devices, then generate high-level recommendations for how these insights can inform the marketing strategy for one Bellabeat product.

Key Stakeholders

  • Urška Sršen
    • Co-founder of Bellabeat
    • Initiator of this analysis
  • Bellabeat marketing team
    • Intended audience for my presentation
    • Will use my insights to guide marketing strategies

Step 2: Prepare

Setting up my tools

Selecting my tools

For this analysis I wanted a tool or set of tools with the following features:

  • Sufficient power enough to handle large data sets, e.g. FitBit data tables with >1M observations
  • Functions for data manipulation, e.g. loading, cleaning and combining data sets
  • Functions for data analysis, e.g. regression analysis and statistical analysis
  • Functions for data visualisation, e.g. plotting
  • Methods for storing the details of my analysis methodology separate from the data itself, e.g. separate source code files, as opposed to macros stored within spreadsheet files
  • Methods for generating reports from my analysis with as little repetition of work as possible, e.g. inline markdown languages, or Page Layout views in spreadsheet applications
  • A straightforward learning process and user interface, e.g. spreadsheet tools typically have a single “Add Chart” tool under which all of their powerful charting options can be found; contrast this with learning to download, enable, and finally use the ggplot2 R library

With these requirements in mind, I considered three tools: R, spreadsheets, and databases:

Feature R Spreadsheets Databases
Power for large data sets Yes No Yes
Data manipulation tools Yes Yes Yes
Data analysis tools Yes Yes No
Data visualisation tools Yes Yes No
Separate analysis files Yes No Yes
Streamlined report generation Yes No No
Straightforward to learn No Yes No

Clearly, R is the best single tool for this analysis. R lacks the straightfoward operation of spreadsheet tools, and I will need to learn libraries and programming techniques as I do my analysis, but this is acceptable given my prior experience with other programming languages like Python.

Note: Python itself was not considered for this analysis: while it shares most of the features, advantages, and disadvantages of R, I’m already familiar with Python and wanted to use this case study to familiarise myself with R instead

Setting up my RStudio environment

The first preparation stage involves setting up my RStudio environment for my analysis.

The R chunk below automatically loads all packages included in the “rqd_pkgs” list, installing them first if required: this ensures all packages can be loaded by other analysts replicating my work, and minimises the effort required to modify the package list.

Getting the data

For this data analysis, I’ll be making use of one public data set specified by Sršen, plus additional data sets as required to address any limitations found in that dataset.

The data set specified by Sršen is the FitBit Fitness Tracker Data Set. This is a public data set available under the CC0 License via Kaggle user Mobius.

For the initial analysis, the data set was downloaded in its entirety from Kaggle and stored locally on my computer. This provided a baseline for analysis, with any modifications to file names, folder re-structuring, or removal of unnecessary data tables to be conducted after I’m familiar with the raw data.

Loading the data

To get a top-level view of the data, I first use the readr package to load all of the required data sets directly into the my R environment.

csv_dir <- "Fitabase_Data_Cleaned"
paths_dfs <- list.files(csv_dir, pattern = "*.csv", full.names = TRUE)

df_names <- paths_dfs %>%
  basename() %>%
  tools::file_path_sans_ext() 

for (i in 1:length(df_names)) {
  assign(df_names[i], read_csv(paths_dfs[i]))
  cat(df_names,"\n")
}
rm(i, paths_dfs)

Understanding the data

Overview

The database contains data on the following features of the FitBit devices:

Feature Units Sampling Rate
BMI BMI Manual/automatic logging
Body Weight kg Manual/automatic logging
Body Fat % Manual/automatic logging
Calorie burn Calories 1 minute
Distance Unknown 1 minute
Heart Rate Beats per Minute (BPM) 5 seconds
Intensity (of exercise) Factor, 0 to 3 1 minute
METs (during exercise) METs 1 minute
Sleep Factor, 1 to 3 1 minute
Steps Steps 1 minute

Some details of the data gathering are not yet clear to me:

  • The real-world meanings of the “Intensity” factor variable levels; presumably 3 is maximum intensity and 0 is inactivity.
  • The real-world meanings of the “Sleep” factor variable levels are; presumably it indicates “quality of sleep”, with 3 being best.
  • What triggers the non-manual reports; presumably it’s logged from another device, like a body composition analyser.

Inconsistent and misleading file names

The file names in the data set present the following issues:

  • Unclear if data is source or summarized, e.g. “minuteCaloriesNarrow_merged.csv” is source data, while “dailyCalories_merged.csv” is summarized data
  • Summarized data not always tied to specific features, e.g. “dailyActivities_merged.csv” contains data summarized from multiple features: there is no “Activity” feature with an associated source data set.
  • Data shape unclear, e.g. “minuteCaloriesNarrow_merged.csv” and “dailyCalories_merged.csv” are both tall data, while “minuteIntensitiesWide_merged.csv” and “dailyIntensities_merged.csv” are both wide data
  • Data sampling rate unclear, e.g. “minuteSleep_merged.csv” compared to “daySleep_merged.csv”
  • Feature is not first in the file name, affecting file sort operations, e.g. “minuteSleep_merged.csv” is sorted closer to “minuteStepsNarrow_merged.csv” than to the related “daySleep_merged.csv”
  • All files are suffixed with “_merged”, so no distinction is made by this information and it could be dropped

To correct each of these issues, I’ll rename the files using this naming convention:

[feature]_[src/sum]_[interval]_[shape].[filetype]

For example, “sleepDay_merged.csv” will be renamed to “sleep_sum_days_wide.csv”.

Inconsistent variable names

All variables in the data set are named in CapitalisedCase, whereas I would typically use snake_case by convention. This is a relatively minor issue that I could go without correcting, however a small number of variable names, like logId, are also inconsistently capitalised across different tables. Since I’ll be adjusting some names anyway, and there’s presumably a library function to change this with minimal effort, I’ll put it on the data-cleaning to-do list.

Inappropriate variable types

Some of the tables in the data set use variables of a type unsuitable for analysis. These variables and their required modifications are tabulated below:

Variable Original Type Updated Type Reason
ActivityDay chr datetime Cannot perform datetime operations on chr variables
ActivityHour chr datetime Cannot perform datetime operations on chr variables
ActivityMinute chr datetime Cannot perform datetime operations on chr variables
Date chr datetime Cannot perform datetime operations on chr variables
Id num chr Disable scientific notation and numerical operations (IDs are not a numeric value)
LogId num chr Disable scientific notation and numerical operations (IDs are not a numeric value)
SleepDay chr datetime Cannot perform datetime operations on chr variables
Time chr datetime Cannot perform datetime operations on chr variables

Missing context for numeric variables

All of the numeric variables in the data set have clearly defined units except for “Distance”. Distance does not appear to have a source data table: it’s only included in the “activity_sum_wide_days” and “intensity_sum_wide_days” tables, and only as summary data grouped via Intensity level. Floating-point values between 0 and 1 are present, so it seems reasonable to assume this is either kilometers or miles, as opposed to meters or feet. No geographical information is given in the data set, so I can’t assume the participants are from the U.S., where miles would be appropriate. Given miles and kilometers represent the same information on slightly different scales, any insights about different use cases between users should still be apparent, therefore I think it’s reasonable to assume the distances are given in kilometers for this analysis.

Missing context for factor variables

Two variables in the data set, exercise “Intensity” and sleep “Value”, are numerical factors with no defined range. For these factors, I can’t tell from the data alone whether all possible values that a FitBit can record are present. The values for exercise intensity, for example, range from 0 to 3; it could be the case that the FitBits used only generate four levels of intensity, but it could just as easily be the case that the values go up to 100 (i.e. a percentage). This has clear implications for our analysis: if 3 is the max, records of 3 indicate users wear their FitBits while exercising as hard as they can, whereas if 100 is the max, records of 3 indicate users wear their FitBits while sitting on the couch as hard as they can.

With this in mind, I’m going to invest a bit of time to confirm the meaning of these variables before attempting to analyse them (read: about ten hours to investigate everything and learn how to do it all in R and learn how to make it look suitably pretty and coherent in RMarkdown).

Validating “Exercise Intensity” by use of R

I’ll start by determining the range of values present in the Intensity data:

cat("Min Intensity:", min(intensity_src_mins_tall$intensity),
    "\nMax Intensity:", max(intensity_src_mins_tall$intensity), "\n", sep = "")

The “activity_sum_days_wide” table provides some context clues as to what the levels might mean: the table summarises Intensity data into four new variables which, based on their names, appear to be associated with intensity levels like so:

  1. Sedentary Minutes
  2. Lightly Active Minutes
  3. Fairly Active Minutes
  4. Very Active Minutes

Let’s see if I can confirm this by recreating the data with that naming convention:

# Generate my version of sleepDay_merged for comparison with the original
intensity_daily_sum_wide <- intensity_src_mins_tall %>%
    mutate(activity_date_floored = floor_date(mdy_hms(ActivityMinute), unit = "days")) %>%
    group_by(Id, activity_date_floored) %>%
    summarize(
        minutes_sedentary      = sum(case_when(Intensity == 0 ~ 1, TRUE ~ 0)),
        minutes_lightly_active = sum(case_when(Intensity == 1 ~ 1, TRUE ~ 0)),
        minutes_fairly_active  = sum(case_when(Intensity == 2 ~ 1, TRUE ~ 0)),
        minutes_very_active    = sum(case_when(Intensity == 3 ~ 1, TRUE ~ 0))
    ) %>%
    mutate(Id_ActivityDate_UID = paste(Id, activity_date_floored, sep = "_")) %>%
    arrange(Id, activity_date_floored, Id_ActivityDate_UID)

# Compare both versions of the data and return any dates with different values
intensity_daily_comp <- activity_sum_days_wide %>%
    mutate(ActivityDate_floored = floor_date(mdy(ActivityDate), unit = "days")) %>%
    mutate(Id_ActivityDate_UID = paste(Id, ActivityDate_floored, sep = "_")) %>%
    with(merge(
        .,
        intensity_daily_sum_wide,
        by = c("Id_ActivityDate_UID"),
        all = TRUE
    )
    ) %>%
    mutate(diff_minutes_sedentary      = minutes_sedentary      - SedentaryMinutes     ) %>%
    mutate(diff_minutes_lightly_active = minutes_lightly_active - LightlyActiveMinutes ) %>%
    mutate(diff_minutes_fairly_active  = minutes_fairly_active  - FairlyActiveMinutes  ) %>%
    mutate(diff_minutes_very_active    = minutes_very_active    - VeryActiveMinutes    ) %>%
    select(
        Id_ActivityDate_UID,
        diff_minutes_sedentary,
        diff_minutes_lightly_active,
        diff_minutes_fairly_active,
        diff_minutes_very_active
    ) %>%
    filter(!(diff_minutes_sedentary == 0 & 
                 diff_minutes_lightly_active == 0 & 
                 diff_minutes_fairly_active == 0 & 
                 diff_minutes_very_active == 0)
    ) %>%
    arrange(Id_ActivityDate_UID)

# For this table, glimpse() shows enough to demonstrate the validity of the method
glimpse(intensity_daily_comp)
rm(intensity_daily_sum_wide, intensity_daily_comp)

The approach above appears to work perfectly, with the exception of the “sedentary minutes” calculation, which is consistently higher in my version.

I think its safe to conclude that I got the mapping correct, given that: a) It’s consistently higher by at least 6 hours, which I suspect is caused by my method counting time asleep as “sedentary minutes” - which is not technically wrong, you know - and b) Reversing the order of the mapping produces entirely wrong results.

That being said, I still don’t know if these are the categories FitBits work in, not another naming convention that the data authors came up with, and I don’t know if all FitBit models work this way, so let’s do something I should have done from the start: read the manuals.

Validating “Exercise Intensity” by reading of manuals

Activity trackers like FitBits detect activity intensity partly by measuring the user’s heart rate while exercising: a higher heart rate corresponds with a higher degree of exertion. As of April 2016, the three latest FitBit models with heart-rate tracking were:

A quick look through the product manuals for each model confirms they all break down user activity into four default heart-rate zones:

Product HR Zone 1 HR Zone 2 HR Zone 3 HR Zone 4
FitBit Blaze “Out of Zone” “Fat burn” “Cardio” “Peak”
FitBit Charge HR “Out of Zone” “Fat burn” “Cardio” “Peak”
FitBit Surge “Out of Zone” “Fat burn” “Cardio” “Peak”

While these aren’t exactly the same terms as used in the data set, they’re clearly related - “Out of Zone” equates to “Sedentary”, for example.

All three FitBit manuals also make the same claim that the default zones are “based on American Heart Association recommendations”. Even without validating that claim, it indicates to me that the reasoning behind each zone is not arbitrary, and is consistent across devices, so I think I can assume any other FitBit models circa 2016 would follow the same classification scheme.

At this point, I’m satisfied that the below are the only four intensity levels I need to consider when analysing the data set, regardless of what models of FitBits were being used:

  1. Sedentary Minutes
  2. Lightly Active Minutes
  3. Fairly Active Minutes
  4. Very Active Minutes

Validating “Sleep Quality” by use of R and reading of manuals

As with exercise intensity, I start by determining the range of values present in the data:

cat("Min Sleep:", min(sleep_src_mins_tall$value),
    "\nMax Sleep:", max(sleep_src_mins_tall$value), "\n", sep = "")

The values for sleep quality range from 1 to 3. The summary data tables for sleep quality introduce only two new variable names: “Total Time In Bed”, and “Total Minutes Asleep”. There’s not a valid name for each level of factor like there was for exercise, so in this case we go straight back to the manuals:

  • Blaze: tracks “both your time spent asleep and your sleep quality”
  • Charge HR: tracks “the hours you sleep and your movement during the night”
  • Surge: tracks “the hours you sleep and your movement during the night”

Further poking around the FitBit help pages on how to track sleep stats and what they all mean reveals that different devices track slightly different data if they have heart rate tracking:

  • No HR tracking: Generic sleep quality tracking with “Time spent awake, restless, and asleep” categories
  • HR tracking: Sleep stage tracking with “Light Sleep, Deep Sleep, and REM Sleep” stages

The help pages also single out the Charge HR and the Surge as the only HR-tracking FitBits to not have full sleep stage tracking, leaving the Blaze as the only device from this time period with that feature. Blaze aside, motion-based sleep quality tracking appears to go all the way back to the FitBit One. Given this information, it seems fair to assume the following mapping for the sleep data:

  • 1: Awake
  • 2: Restless
  • 3: Asleep

I can confirm this mapping by attempting to recreare duplicate the summary data in sleepDay_merged:

As with the intensity data, I can confirm this by recreating the data with that naming convention:

# Generate my version of sleepDay_merged for comparison with the original
sleep_src_mins_tall_NW <- sleep_src_mins_tall %>%
    # mutate(date_typed = mdy_hms(date)) %>%
    mutate(date_floored = floor_date(mdy_hms(date), unit = "days")) %>%
    # Sum time asleep for each Log ID
    group_by(logId) %>%
    summarize(
        "Id" = min(Id),
        # Associate each Log ID with the latest date recorded under it
        "SleepDay" = max(date_floored),
        "minutes_in_bed"   = n(),
        "minutes_awake"    = sum(case_when(value == 3 ~ 1, TRUE ~ 0)),
        "minutes_restless" = sum(case_when(value == 2 ~ 1, TRUE ~ 0)),
        "minutes_asleep"   = sum(case_when(value == 1 ~ 1, TRUE ~ 0))
    ) %>%
    # Sum time asleep for each date based on SleepDay
    group_by(Id, SleepDay) %>%
    summarize(
        "TotalSleepRecords_2" = n(),
        "TotalMinutesAsleep_2" = sum(minutes_asleep),
        "TotalTimeInBed_2" = sum(minutes_in_bed),
        "TotalMinutesAwake" = sum(minutes_awake),
        "TotalMinutesRestless" = sum(minutes_restless),
    ) %>%
    mutate("Id_SleepDay_UID" = paste(Id, SleepDay, sep = "_")) %>%
    arrange(Id_SleepDay_UID)

# Compare both versions of the data and return any dates with different values
sleepDay_comp <- sleep_sum_days_wide %>%
    # mutate("SleepDay_typed" = mdy_hms(SleepDay)) %>%
    mutate("SleepDay_floored" = floor_date(mdy_hms(SleepDay), unit = "days")) %>%
    mutate("Id_SleepDay_UID" = paste(Id, SleepDay_floored, sep = "_")) %>%
    arrange(Id_SleepDay_UID) %>%
    with(merge(
        .,
        sleep_src_mins_tall_NW,
        by = c("Id_SleepDay_UID"),
        all = TRUE
    )
    ) %>%
    mutate(recordDiff = TotalSleepRecords_2 - TotalSleepRecords) %>%
    mutate(sleepDiff = TotalMinutesAsleep_2 - TotalMinutesAsleep) %>%
    mutate(bedDiff = TotalTimeInBed_2 - TotalTimeInBed) %>%
    select(
        Id_SleepDay_UID,
        recordDiff,
        sleepDiff,
        bedDiff
    ) %>%
    filter(!(recordDiff == 0 & sleepDiff == 0 & bedDiff == 0)) %>%
    arrange(Id_SleepDay_UID)

# For this table, glimpse() shows enough to demonstrate the validity of the method
glimpse(sleepDay_comp)
rm(sleep_src_mins_tall_NW, sleepDay_comp)

I was able to recreate the existing sleep_src_mins_tall table almost perfectly by summing sleep times per Log ID, with the latest date associated with each Log ID being used as the “date” for that sleep. In practice it turns out I had the mapping inverted, and actually the following is used:

  • 1: Asleep
  • 2: Restless
  • 3: Awake

So I guess read that as “1 is highest-quality sleep, 3 is worst-quality”.

My version of the data contains a few rows that vary slightly from the original, by 1 to 22 minutes. I haven’t been able to determine the source of this error, but they’re more than close enough to confirm I don’t have the factor level mapping backwards, so I’m ready to proceed.

Duplicate data between tables

The data set includes some tables that contain source data for a given feature, e.g. heart-rate tracking, and others that contain summary data, e.g. everything in the “activity_days_sum_wide” table.

Some of these are useful, for instance:

  • dailyActivities_merged.csv calculates Sedentary Minutes by excluding time spent asleep. This either requires clever/time-consuming cross-referencing with other tables, or is raw data from a calculation performed on the FitBit itself: either way, I don’t want to have to do it again
  • sleepDay_merged.csv summarises sleep on a per-night basis, which is more useful for comparing users than the raw, minute-by-minute sleep data

Others are not useful, for instance:

  • calories_sum_mins_wide.csv, intensity_sum_mins_wide.csv, and steps_sum_mins_wide.csv all just pivot minute-level data into one 60-column row per hour containing the same data
  • minuteIntensitiesWide_merged.csv just sums and averages intensity per hour

Those tables that do not provide useful summaries can be excluded from the data analysis: if a specific need is found for their data, they can be reloaded or recreated manually as required.

Duplicate data within tables

The “bodycomp_logs_src_wide.csv” file contains both kilogram and pound variables: these describe the same information, and all of the observations contain values for both variables, so one variable can be dropped with no loss of data. The choice between the two formats seems arbitrary for my analysis, so I’m choosing to keep the kilos data as its expressed in an SI unit.

intensity_sum_hours_wide.csv contains Total Intensity and Average Intensity. Total intensity values exceed 4, the maximum for intensity, and so are not actually useful. Average Intensity is just the Total Intensity for each hour divided by 60 minutes per hour.

Step 3: Clean

TODO: Data Cleaning Checklist

  • Convert list into changelog format

Update file names

For this analysis, I’ll rename the files to use this naming convention:

[feature]_[src/sum]_[interval]_[shape].[filetype]

Applying the naming convention to the data set yields the following file names:

Original Updated
dailyActivity_merged.csv activity_sum_days_wide.csv
dailyCalories_merged.csv calories_sum_days_tall.csv
dailyIntensities_merged.csv intensity_sum_days_wide.csv
dailySteps_merged.csv steps_sum_days_tall.csv
heartrate_seconds_merged.csv heartrate_src_seconds_tall.csv
hourlyCalories_merged.csv calories_sum_hours_tall.csv
hourlyIntensities_merged.csv intensity_sum_hours_wide.csv
hourlySteps_merged.csv steps_sum_hours_tall.csv
minuteCaloriesNarrow_merged.csv calories_src_mins_tall.csv
minuteCaloriesWide_merged.csv calories_sum_mins_wide.csv
minuteIntensitiesNarrow_merged.csv intensity_src_mins_tall.csv
minuteIntensitiesWide_merged.csv intensity_sum_mins_wide.csv
minuteMETsNarrow_merged.csv mets_src_mins_tall.csv
minuteSleep_merged.csv sleep_src_mins_tall.csv
minuteStepsNarrow_merged.csv steps_src_mins_tall.csv
minuteStepsWide_merged.csv steps_sum_mins_wide.csv
sleepDay_merged.csv sleep_sum_days_wide.csv
weightLogInfo_merged.csv bodycomp_src_logs_wide.csv

The conversion was performed manually on my local device.

Check for missing data

Each data frame was checked for NULL values and non-null empty strings.

The only table with missing data was “bodycomp_src_logs_wide”, which was missing all but two of the data points for Body Fat Percentage. Given this lack of data, Body Fat Percentage will be excluded from the data analysis.

# Check for NULLs
cat("Checking for NULL/empty values...\n", sep="")
for (df_name in df_names) {
  df <- get(df_name)
  num_nulls <- sum(is.na(df))
  if(num_nulls) {
    cat(df_name,": ",num_nulls," NULLs","\n",sep="")
  }
  # Check for empty strings (which do not show up as NULLs)
  empty_strings <- df %>%
    filter(if_any(where(is.character), ~ nchar(.) == 0))
  num_empty_strings = nrow(empty_strings)
  if(num_empty_strings) {
    cat(df_name,": ",num_empty_strings," empty strings","\n",sep="")
  }
}
cat("Checking for NULL/empty values done.\n", sep="")
rm(df, df_name, num_nulls, empty_strings, num_empty_strings, df_name)
  • Looked for NULLs and empty strings. The “bodycomp_src_logs_wide” table was found to be missing the majority of the entries for the body-fat percentage variable: this variable has been excluded from the analysis.

Clean and update variable names

Functions and environment variables used in the code below have been omitted for brevity and can be found in Appendix A.

## Rename Variables ----

cat("Cleaning variable names...\n", sep = "")
for(df_name in df_names) {
  cat("Cleaning ",df_name,"...\n", sep = "")
  assign(df_name, get(df_name) %>% clean_names())
}
cat("Cleaning variable names complete.\n", sep = "")

cat("Renaming variables...\n", sep = "")
var_mods_rename <- var_mods %>%
  filter(var_old != "" & var_new != "")
for(df_name in df_names) {
  assign(df_name, rename_df_variables(df_name, var_mods_rename))
}
rm(df_name)
cat("Renaming variables complete.\n", sep = "")

Remove duplicate data

All dataframes were checked for duplicate rows using the anyDuplicated() function, and duplicate rows were removed using the distinct() function. The function was tested using a prototype version that generated dataframes of all of the apparent duplicate rows: these were verified manually before the function was allowed to modify the actual dataframes. The logic was verified again by re-running it after the initial removal: all dataframes reported zero duplicates on the second run, confirming the success of the first pass.

cat("Checking for duplicate values...\n",sep="")
for (df_name in df_names) {
  cat("Checking ",df_name," for duplicates... ",sep="")
  df <- get(df_name)
  if (!anyDuplicated(df)) {
    cat("0 duplicates removed. Done.\n",sep="")
  } else {
    nrow_before <- nrow(df)
    df <- distinct(df)
    nrow_after <- nrow(df)
    cat("Removing ",(nrow_before - nrow_after)," duplicates... ",sep="")
    assign(df_name, df)
    cat("Done.\n",sep="")
  }
}
rm(df, df_name, nrow_before, nrow_after)
cat("Checking for duplicate values complete.\n",sep="")

Recast mismatched variable data types

Variable Original Type Updated Type Reason
activity_day chr datetime Cannot perform datetime operations on chr variables
activity_hour chr datetime Cannot perform datetime operations on chr variables
activity_minute chr datetime Cannot perform datetime operations on chr variables
date chr datetime Cannot perform datetime operations on chr variables
id num chr Disable numerical operations (IDs are considered a UID string)
log_id num chr Disable numerical operations (IDs are considered a UID string)
time chr datetime Cannot perform datetime operations on chr variables
sleep_rank num factor 1:3 Disable numerical operations (value is a ranking, not an amount)
intensity num factor 0:3 Disable numerical operations (value is a ranking, not an amount)
# Global Variable Declarations ----

var_mods_recast <- var_mods %>%
  filter(type_new != "")

# Function Declarations ----

are_identical_lists <- function(list1, list2) {
  if (length(list1) != length(list2)) {
    return(FALSE)
  }
  for (i in seq_along(list1)) {
    if(!identical(list1[[i]], list2[[i]])) {
      cat("Non-identical lists at list1[",i,"]. Exiting.\n", sep = "")
      print(list1[[i]])
      print(list2[[i]])
      rm(i)
      return(FALSE)
    }
  }
  rm(i)
  return(TRUE)
}

get_df_var_types <- function(df_name) {
  cat("Getting current variable types for ", df_name, "... ", sep = "")
  df <- get(df_name)
  var_types <- data.frame(
    var = names(df),
    type = sapply(df, function(col) class(col)[1])
  )
  cat("Done.\n", sep = "")
  return(var_types)
}

get_df_target_var_types <- function(df_name) {
  cat("Getting target variable types for ", df_name, "...\n", sep = "")
  var_types <- get_df_var_types(df_name)
  # Iterate over rows in current var_types
  for (var_row in 1:nrow(var_types)) {
    # Check if var name is in var_mods_recast
    var_name <- var_types$var[var_row]
    for(mods_row in 1:nrow(var_mods_recast)) {
      var_name_mods <- var_mods_recast$var_new[mods_row]
      # If no, skip: if yes, replace type with target
      if(var_name == var_name_mods) {
        type_new <- var_mods_recast$type_new[mods_row]
        cat("Updated \"", var_name, "\" from ", var_types$type[var_row], " to ", var_mods_recast$type_new[mods_row], ".\n", sep = "")
        var_types$type[var_row] <- type_new
      }
    }
  }
  cat("Getting target variable types for ", df_name, " done.\n", sep = "")
  return(var_types)
}

recast_variables <- function(df_name) {
  df <- get(df_name)
  for (i in 1:nrow(var_mods_recast)) {
    var_new <- var_mods_recast$var_new[i]
    type_new <- var_mods_recast$type_new[i]
    if (var_new %in% colnames(df)) {
      cat("Converting ",df_name,"$",var_new," to ",type_new, "... ", sep = "")
      if (type_new == "character") {
        df <- df %>% mutate("{var_new}" := as.character(!!sym(var_new)))
      } else if (type_new == "Date") {
        df <- df %>% mutate("{var_new}" := mdy(!!sym(var_new)))
      } else if (type_new == "POSIXct") {
        df <- df %>% mutate("{var_new}" := mdy_hms(!!sym(var_new)))
      } else {
        cat("type_new not found: not converting.", sep = "")
      }
      cat("Done.\n", sep = "")
    }
  }
  rm(i)
  return(df)
}

# Recast Variables ----

cat("Generating list of target column types for testing...\n", sep = "")
df_types_target <-lapply(df_names, get_df_target_var_types)
names(df_types_target) <- df_names
cat("Generating list of target column types complete.\n", sep = "")

cat("Recasting variables...\n", sep = "")
for (df_name in df_names) {
  cat("Recasting ",df_name,"...\n", sep = "")
  assign(df_name, recast_variables(df_name))
  cat("Converting ",df_name," complete.\n", sep = "")
}
cat("Recasting variables complete.\n", sep = "")

# Test Recasting of Variables ----

cat("Generating list of updated column types...\n", sep = "")
df_types_after <- lapply(df_names, get_df_var_types)
names(df_types_after) <- df_names
cat("Generating list of updated column types complete.\n", sep = "")

cat("Checking updated column types against target types...\n", sep = "")
test_succeeded <- are_identical_lists(df_types_after, df_types_target)
cat("Checking updated column types against target types complete.\n", sep = "")
cat("Data recasting ", case_when(test_succeeded ~ "succeeded", TRUE ~"failed"), ".", sep = "")
rm(df_name, df_types_target, df_types_after, test_succeeded)
# Individual recast of variables with only one occurrence
sleep_src_mins_tall <- sleep_src_mins_tall %>%
    mutate(sleep_rank = factor(sleep_rank, levels = 1:3, labels = c("Asleep", "Restless", "Awake")))

intensity_src_mins_tall <- intensity_src_mins_tall %>%
    mutate(intensity = factor(intensity, levels = 0:3, labels = c("Sedentary", "Lightly Active", "Fairly Active", "Very Active")))

Given the large number of variables in the data set, the recasting procedure includes test code to confirm the updated variable types match the types specified in the variable mods list. The test code works by first generating a list of the desired final variable/type pairs, then, once the conversion is completed, generating a second list of the actual variable/type pairs in the data frames to compare it to. This has the advantage of not just confirming the desired conversions took place, but also checks for any unintended changes to variables that did not require conversion.

Building the test code added a decent amount of work to the project, but now I have it working, it can be reused and scaled to future projects.

Trim leading or trailing characters

  • Manual check for variable names with non-whitespace trailing characters
  • Automated trim of variable names
  • Automated trim of character-type data entries
cat("Trimming whitespace in variable names...\n", sep="")
for (df_name in df_names) {
  df <- get(df_name)
  for (col_name in colnames(df)) {
    col_name_trimmed <- str_trim(col_name)
    cat("Trimming ",df_name, "[",col_name,"] to ",col_name_trimmed,"... ",sep="")
    df <- df %>% rename(!!col_name := !!col_name_trimmed)
    cat("Done.\n", sep = "")
  }
}
rm(df, df_name, col_name, col_name_trimmed)
cat("Trimming whitespace in variable names complete.\n", sep="")
trim_chr_column <- function(col) {
  if (is.character(col)) {
    return(str_trim(col))
  } else {
    return(col)
  }
}

cat("Trimming whitespace in character-type data values...\n", sep="")
for (df_name in df_names) {
  df <- get(df_name)
  for (col_name in colnames(df)) {
    if (is.character(df[[col_name]])) {
      cat("Trimming ",df_name, "[",col_name,"]... ",sep="")
      # df <- df %>% mutate(if_any(where(is.character), trim_chr_column))
      df <- df %>% mutate(across(all_of(col_name), str_trim))
      cat("Done.\n", sep = "")
    }
  }
}
rm(df, df_name, col_name)
cat("Trimming whitespace in character-type data values complete.\n", sep="")

Validate numeric data

Given that the data was not entered manually by the user, there’s no way for me to manually check the correctness of all of the numeric values included in the data set. The values were instead checked against pre-determined limits to verify that they fall within realistic ranges, as detailed below.

Validating the data in this way also helps confirm the data makes sense in terms of the business logic, by confirming the data falls within realistic ranges given the capabilities of the devices and the types of data they claim to track.

ID lengths and value ranges are correct

I wrote a short function to validate the length of all rows in a data frame for a given column number and valid length. This was then used to check ID values in the data set against their correct length, including:

  • “id” values: 10-digit
  • “sleep_log_id” values: 11-digit
  • “bodycomp_log_id” values: 13-digit

The validation confirmed all values were the correct length.

validate_string_length <- function(df_name, col_name, valid_length) {
  cat("Checking ",df_name," for invalid ",col_name," values... ",sep="")
  df <- get(df_name)
  if (col_name %in% colnames(df)) {
    invalid_values <- df %>% select(!!sym(col_name)) %>% filter(nchar(!!sym(col_name)) != valid_length)
  }
  if (exists('invalid_values') && nrow(invalid_values) > 0) {
    cat(nrow(invalid_values)," invalid values found:\n",sep="")
    glimpse(invalid_values)
    rm(invalid_values)
  } else {
    cat("complete.\n",sep="")
  }
  rm(df)
}

result <- map(df_names, ~validate_string_length(.x, col_name = "id", valid_length = 10))
result <- map(df_names, ~validate_string_length(.x, col_name = "sleep_log_id", valid_length = 11))
result <- map(df_names, ~validate_string_length(.x, col_name = "bodycomp_log_id", valid_length = 13))
rm(result)

To further validate the large number of ID data points in the data set, I wrote a short function to check all ID variables in each data frame and determine the minimum and maximum values. The intent was to identify any possible erroneous values within the acceptable length limits, e.g. all zeroes or all nines. The function makes use of the direct conversion of strings to numerals in the min() and max() functions to carry out the comparison on character-type values.

The validation confirmed all values were within a realistic range.

cat("Checking min/max ID value ranges...\n",sep="")
id_col_names <- c("id", "sleep_log_id", "bodycomp_log_id")
for (df_name in df_names) {
  df <- get(df_name)
  for (col_name in id_col_names) {
    if(col_name %in% colnames(df)) {
      cat(df_name,"[",col_name,"] Min = ",min(df[[col_name]])," Max = ",max(df[[col_name]]),".\n",sep="")
    }
  }
}
rm(df, df_name, col_name, id_col_names)
cat("Checking min/max ID value ranges complete.\n",sep="")

Dates are all within range

The data set is described as containing data from users collected between “03.12.2016-05.12.2016”: all records in the data set were validated against this date range.

validate_dates <- function(df_name) {
  valid_dates_stt <- as.Date("2016-03-12", format = "%Y-%m-%d")
  valid_dates_end <- as.Date("2016-05-14", format = "%Y-%m-%d")
  
  cat("Validating dates for ",df_name,"... ",sep="")
  
  date_columns <- get(df_name) %>%
    select_if(function(col) is.POSIXct(col) || is.Date(col))
  
  if (ncol(date_columns) == 0) {
    cat("Error: no date column found.\n",sep="")
  } else if (ncol(date_columns) > 1) {
    cat("Error: more than one date column found for ",df_name,".\n",sep="")
  } else {
    outside_range <- date_columns %>%
      filter(if_any(everything(), ~ . < valid_dates_stt | . > valid_dates_end))
    if (nrow(outside_range) > 0) {
      cat("found ",nrow(outside_range)," invalid values:\n",sep="")
      print(outside_range)
      rm(outside_range)
    } else {
      cat("complete.\n")
    }
  }
  rm(date_columns)
}

result <- lapply(df_names, validate_dates)
rm(result)

This check found dates just outside the range, dating up to 8am on 05.13.2016, the day after the data set supposedly ended. I didn’t consider this to be a problem, so the valid end-date was updated to the 14th of May 2016 accordingly, and all dates passed this check.

Non-date values are within appropriate ranges for their units

Checklist: - All numeric values are non-negative - Percentage values are less than 100 - Weight values are positive and make sense (e.g. less than 200kg) - BMI values are in range - Daily, hourly, and minute duration sums are no more than one day, hour, or minute, respectively - Distances make sense - Step counts make sense (check for >20000 for a start) - Calories are within normal range - Heart rates are less than 200

First, a check for negative values was run on all numeric columns in the data set.

validate_numerics <- function(df_name) {
  # This function performs all checks on numeric values that are required in more than one data-frame, e.g. non-negativity and summation
  cat("Validating numerics for ",df_name,"... ",sep="")
  numerics <- get(df_name) %>% select_if(is.numeric)
  if (ncol(numerics) == 0) {
    cat("No numeric variables found.\n",sep="")
  } else {
    # Check for negative values
    negative_values <- numerics %>%
      filter(if_any(everything(), ~ . < 0))
    if (nrow(negative_values) > 0) {
      cat("found ",nrow(negative_values)," invalid values:\n",sep="")
      print(negative_values)
    } else {
      cat("complete.\n")
    }
    rm(negative_values)
    # Check for summation
  }
  rm(numerics)
}

result <- lapply(df_names, validate_numerics)
rm(result)

Results:

  • No negative values were found in the data.

Second, variable-specific checks were run to confirm the data fell within realistic ranges given my understanding of the variables being measured.

Note: The maximum values given are used as thresholds above which values may not be realistic, not as hard limits for acceptability: a heart-rate of 200bpm, for example, is entirely possible, but it is high enough that I would want to check if the data point corresponded to a period of high-intensity exercise.

# For a given list of dfs, column names, and a min-max range, check all matching columns in all matching dfs against that range
validate_within_range <- function(df_name, column_names, range_min, range_max) {
  df <- get(df_name)
  for (col_name in column_names) {
    cat("Checking ranges for ",df_name,"[",col_name,"]... ",sep="")
    if (!(col_name %in% colnames(df))) {
      cat("column not found.\n")
    } else {
      out_of_range <- df %>%
        filter(!!sym(col_name) < range_min | !!sym(col_name) > range_max)
      if (nrow(out_of_range) <= 0) {
        cat("complete.\n",sep="")
      } else {
        cat("Found ",nrow(out_of_range)," out-of-range values:\n",sep="")
        glimpse(out_of_range)
      }
      rm(out_of_range)
    }
  }
rm(df, col_name)
}

# Minute summation
valid_df_names <- c(
  "activity_sum_days_wide",
  "intensity_sum_days_wide",
  "sleep_sum_days_wide")
valid_col_names <- c(
  "minutes_very_active",
  "minutes_fairly_active",
  "minutes_lightly_active",
  "minutes_sedentary",
  "minutes_asleep_total",
  "minutes_in_bed_total")
range_max <- 60 * 12 # Minutes in half a day
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Weights (kg)
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("weight_kg")
range_max <- 150 # Arbitrarily chosen as high enough to be a potentially erroneous value
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Weights (pounds, same limit as for weight in kilos)
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("weight_pounds")
kg2lb <- 2.204623
range_max <- 150 * kg2lb # Arbitrarily chosen as high enough to be a potentially erroneous value
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
rm(kg2lb)

# BMI 
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("bmi")
range_max <- 40.0 # Corresponds with the WHO "Obese (Class III)" weight category
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Distances
valid_df_names <- c(
  "activity_sum_days_wide",
  "intensity_sum_days_wide")
valid_col_names <- c(
  "distance_lightly_active",
  "distance_logged_activities",
  "distance_moderately_active",
  "distance_sedentary",
  "distance_total",
  "distance_tracker",
  "distance_very_active")
range_max <- 21.08241 # Equivalent to one half-marathon
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Step counts
valid_df_names <- c(
  "activity_sum_days_wide",
  "steps_sum_days_tall",
  "steps_sum_hours_tall",
  "steps_src_mins_tall")
valid_col_names <- c(
  "steps",
  "steps_total")
range_max <- 14800 # Set to double the average daily step count for Australian adults
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Calories
valid_df_names <- c(
  "activity_sum_days_wide",
  "calories_src_mins_tall",
  "calories_sum_days_tall",
  "calories_sum_hours_tall")
valid_col_names <- c("calories")
range_max <- 4000 # Chosen arbitrarily as double the typically-recommended daily caloric intake
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# TODO: METs
valid_df_names <- c("mets_src_mins_tall")
valid_col_names <- c("mets")
range_max <- 12 # Equivalent to vigourous squash playing
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Heart Rates
valid_df_names <- c("heartrate_src_seconds_tall")
valid_col_names <- c("heart_rate")
range_max  <- 200 # Chosen based on average 100% heart-rate for a 20 y.o. (Source: American Heart Foundation)
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

rm(range_max, result, valid_df_names, valid_col_names)

Results:

  • There were no invalid negative values
  • There were no invalid percentage values
  • There were no invalid minute summations
  • There were 19 step count data points above 20,000: these were manually checked and found to be associated with long distances covered and high levels of exercise, which made sense, so I considered that data valid.
  • There were a total of four unique data points with a distance covered of more than one half-marathon: all of these also indicatd a very high level of exercise for at lest 90 minutes, which made sense, so I considered the data valid
  • Calories: 21 records showed caloric burns of over 4000 calories. All but one of those were associated with very high levels of activity: the outlier showed only 30 minutes total as “Very Active” or “Lightly Active”, compared to 120+ minutes for all other records. It does show 15km distance covered that day, which is nearly a third of a marathon, so for now it makes enough sense to keep it in, and I will analyse it further after the cleaning stage.
  • Heart-rate: 13 records were found with a heart-rate between 200 and 203, all of which corresponded to a single user over a 30-minute period of “Very Active” intensity.
  • METs: Every value was out of range for the original limit of 12 METs, chosen as it was the highest listed MET value for any activity in the source data. The smallest repeated value in the data was 10. This suggests my interpretation of the data logged for this variable is wrong.

Validating the METs data

  • In theory, METs are a rate of energy expenditure: if you expend energy at a rate of 3 METs, that value doesn’t change whether you maintain it for a minute or an hour
  • The highest MET value for any given activity in the sources I used was 12, for sustained “heavy” squash playing
  • My initial assumption was that the MET values in the data were the average MET rate for the sampling period, in this case one minute
  • The data points, however, were listing MET values as high as 157
  • This suggested the values were not averages but rather some sum or projected sum, e.g. projected MET-hours based on the average over the minute.
  • Searching through the product manuals again revealed no information on METs at all. A Help article was the only other official reference to METs that I could find: it implies that METs are used to calculate “Active Minutes”, and that Active Minutes count double in higher heart-rate zones, but no specific mathematical relationship between the two was described.

With this in mind, I attempted to corroborate the METs data by plotting it against other related data, specifically calories-burned and heart-rate. Individual charts were plotted for each user ID.

# Generate minute-level summary heart-rate data for merging with calorie/
heartrate_sum_minutes_tall <- heartrate_src_seconds_tall %>% 
  mutate(activity_minute = floor_date(heart_rate_second, unit = "minute")) %>%
  group_by(id, activity_minute) %>%
  summarize(heart_rate_mean = mean(heart_rate))

mets_vs_heart_rate <- mets_src_mins_tall %>%
  merge(., intensity_src_mins_tall, by = c("id", "activity_minute")) %>%
  merge(., heartrate_sum_minutes_tall, by = c("id", "activity_minute"))

mets_vs_calories <- mets_src_mins_tall %>%
  merge(., intensity_src_mins_tall, by = c("id", "activity_minute")) %>%
  merge(., calories_src_mins_tall, by = c("id", "activity_minute"))

plot_mets_vs_heartrate <- ggplot(data = mets_vs_heart_rate) +
  geom_point(mapping=aes(x=heart_rate_mean, y=mets, shape=intensity, color=intensity)) +
  geom_smooth(mapping=aes(x=heart_rate_mean, y=mets), method = "lm", se = FALSE, color = "blue") +
  stat_cor(aes(x=heart_rate_mean, y=mets, label=..rr.label..), label.x=0, label.y=50) +
  facet_wrap(vars(id))
print(plot_mets_vs_heartrate)

plot_mets_vs_calories <- ggplot(data = mets_vs_calories) +
  geom_point(mapping=aes(x=calories,y=mets, shape=intensity, color=intensity)) +
  # geom_smooth(mapping=aes(x=calories,y=mets), method = "lm", se = FALSE, color = "blue") +
  stat_cor(aes(x=calories, y=mets, label=..rr.label..), label.x=0, label.y=50) +
  facet_wrap(vars(id))
print(plot_mets_vs_calories)

plot_mets_vs_intensity <- ggplot(data = mets_vs_calories) +
  geom_point(mapping=aes(x=intensity,y=mets)) +
  # geom_smooth(mapping=aes(x=calories,y=mets), method = "lm", se = FALSE, color = "blue") +
  #stat_cor(aes(x=calories, y=mets, label=..rr.label..), label.x=0, label.y=50) +
  facet_wrap(vars(id))
print(plot_mets_vs_intensity)

Results:

  • There was a clear positive correlation between heart-rate and METs, with R^2 values ranging from 0.43 to 0.82
  • A much tighter positive correlation was present between calories and METs, with R^2 values of 1 across all user ID values.

Given the very tight relationship between METs and calories, and the unclear sampling method used to generate the MET data, I decided to keep the METs data as-is, and consider all values as valid for the purposes of cleaning..After cleaning the data, further analysis of the MET data may shed more light on its meaning: in the meantime, the calories, heart-rate, and intensity variables can be used to analyse energy expenditure and activity type.

TODO Step 4: Analyse

Method

Use the advertised features of the Ivy to guide the analysis

Function Declarations

These are generic functions I developed for use during the analysis. They are declared here to be made available to subsequent code chunks.

round_data_to_bin <- function(data, bin_width) {
  rounding_factor <- 1 / bin_width
  return(round(data * rounding_factor) / rounding_factor)
}

get_histogram_max_count <- function(data, bin_width) {
  rounding_factor <- 1 / bin_width
  # Round data to nearest multiple of bin_width
  data_rounded <- round_data_to_bin(data, bin_width)
  max_count <- max(table(data_rounded))
  return(max_count)
}

rescale_plot <- function(p, x_min = 0, x_max = 10, x_step = 1, y_min = 0, y_max = 10, y_step = 1) {
  p <- p +
    scale_x_continuous(breaks = seq(x_min, x_max, by=x_step),
                       labels = scales::comma_format(),
                       limits=c(x_min, x_max)) +
    scale_y_continuous(breaks = seq(y_min, y_max, by=y_step),
                       labels = scales::comma_format(),
                       limits=c(y_min, y_max))
  return(p)
}
plot_histo_pareto <- function (data, bin_width) {
  # Cumulative data uses same bins as histogram
  data <- sort(data)
  data_rounded <- round_data_to_bin(data, bin_width)
  
  # Calculate highest count so axes can be adjusted
  highest_count <- get_histogram_max_count(data, bin_width)
  data_pareto <- seq(1, length(data), by = 1) / length(data) * highest_count

  # Calculate stats for overlay on histogram
  data_max <- max(data)
  data_mean <- round(mean(data), digits = 2)
  data_median <- round(median(data), digits = 2)
  sum_stats <- data.frame(Statistics = c("Mean", "Median"),
                          value = c(data_mean, data_median))
  label_text <- paste("Mean =", data_mean, ", Median =", data_median)

  p <- ggplot() +
    geom_histogram(aes(x = data),
                   color = "white",
                   binwidth = bin_width) +
    geom_line(aes(x = data_rounded,
                  y = data_pareto),
                  color="forestgreen") +
    geom_vline(data = sum_stats,
               aes(xintercept = value,
                   linetype = Statistics,
                   color = Statistics),
               size = 1) +
    scale_x_continuous(breaks = seq(0, data_max + bin_width, by = bin_width)) +
    scale_y_continuous(name = "Users",
                       breaks = seq(0, highest_count, by = 1),
                       sec.axis = sec_axis(~./highest_count, name = "Cumulative Percentage of Users")) +
    # annotate("text", x = Inf, y = Inf, label = label_text,
    #        hjust = 1, vjust = 1, size = 4) +
    labs(x = "Value",
         caption = label_text)
  return(p)
}
get_time_coefficients <- function(ids, timestamps, values) {
  tbl <- tibble(
    id = ids,
    timestamp = timestamps,
    value = values
  )

  coeffs <- tbl %>%
    mutate(day_of_year = yday(timestamp)) %>%
    group_by(id) %>%
    summarize(correlation = cor(day_of_year, value))

  return(coeffs)
}

plot_time_coefficients <- function(coeffs) {
  bin_width <- 0.1
  max_count <- get_histogram_max_count(coeffs$correlation, bin_width)
  
  coeffs_mean <- round(mean(coeffs$correlation, na.rm = TRUE), digits = 2)
  coeffs_median <- round(median(coeffs$correlation, na.rm = TRUE), digits = 2)
  sum_stats <- data.frame(Statistics = c("Mean", "Median"),
                          value = c(coeffs_mean, coeffs_median))
  label_text <- paste("Mean =", coeffs_mean, "\nMedian =", coeffs_median)
  
  ggplot(coeffs, aes(x = correlation)) +
    geom_histogram(binwidth = bin_width, color = "white") +
    geom_vline(data = sum_stats,
               aes(xintercept = value,
                   linetype = Statistics,
                   color = Statistics),
               size = 1) +
    scale_x_continuous(breaks = seq(-1.0, 1.0, by=bin_width)) +
    scale_y_continuous(breaks = seq(0, max_count, by=1)) +
    annotate("text", x = Inf, y = Inf, label = label_text,
           hjust = 1, vjust = 1, size = 4) +
    labs(title = "Correlations with Time",
         caption = "Positive values imply variable of interest generally increased over time.",
         x = "Coefficient of Correlation",
         y = "Count")
}

get_time_coefficients_plot <- function(ids, timestamps, values) {
  coeffs <- get_time_coefficients(ids, timestamps, values)
  p <- plot_time_coefficients(coeffs)
  return(p)
}
wide_to_stacked_bar_plot <- function(data_wide, key, value, key_order) {
  if(!("id" %in% colnames(data_wide))) {
    print("ERROR: data does not include \"id\" column: cannot convert.")
  } else {
    # Convert the data from wide to long. Set the factor levels to control the stacking order of the bars
    data_long <- data_wide %>%
      tidyr::gather(key = !!sym(key), value = !!sym(value), -id) %>%
      mutate(!!sym(key) := factor(!!sym(key), levels = key_order))

    # Order IDs in wide data based on value of first key, then rearrange long data.
    # This ensures the resultant plot sorts the IDs in ascending order of the first key
    first_key <- key_order[1]
    data_wide <- data_wide %>% arrange(!!sym(first_key))
    data_long$id <- factor(
      data_long$id,
      levels = data_wide$id)
    
    # Plot graph with angled ID labels
    p <- ggplot(data_long, aes(x= id, y= !!sym(value), fill= !!sym(key))) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
    # scale_y_continuous(breaks = seq(0, 24, by = 1))
  }
}
scatter_with_LOBF <- function(data_x, data_y, labels = NULL) {
  data <- tibble(
    x = data_x,
    y = data_y
  )
  
  p <- ggplot(data,
              aes(x = x, y = y)) +
    geom_point() +
    geom_smooth(method = "lm",
                se = FALSE,
                color = "blue") +
    stat_cor(mapping=aes(label=..rr.label..),
             method="pearson",
             label.x=-Inf,
             label.y=Inf,
             hjust = -0.1,
             vjust = 1.1) +
    geom_text(aes(label = sprintf("Gradient: %.2f", coef(lm(y ~ x, data = data))[2])),
              x = -Inf,
              y= Inf,
              hjust = -0.05,
              vjust = 3.5)
  
  if(!is.null(labels) && length(labels > 1)) {
    p <- p + labs(title = paste(labels[1], "vs.", labels[2]),
                  x = labels[1],
                  y = labels[2])
  }
  
  return(p)
}

Heart Rate Tracking, Activity Tracking, and Step Counts

Feature Overview

  • Heart-rate Tracking: “Use it to track your workout progress and optimize personal training routines.”

  • Exercise Tracking: “Ivy will recognize your activity during the day, help you track up to 80 types of activity, count your steps, and discover how all that affects your body.”

  • Step Counts

  • Do people count their steps? Yeah, all but three of them did

  • Do people engage in different types of activity? Yeah, big spread of Fairly/Very Active intensity levels, safe to say people don’t all work out the same, so the more activity tracking the merrier

These three features are closely related, since one of the primary functions of the HR tracking is to detect exercise intensity; steps and distance tracking can also be used to further analyse users’ activity.

In order to better understand how people are using their FitBits, I analysed the amounts and intensity of different users exercise.

Analysis

How much time do people spend exercising?

I start by getting the top-level breakdown of users time vs intensity:

# For each ID, generate averages for the three non-sedentary intensity levels
mean_daily_intensities_wide <- intensity_sum_days_wide %>%
  group_by(id) %>%
  summarize("sedentary" = mean(minutes_sedentary) / 60,
            "lightly_active" = mean(minutes_lightly_active) / 60,
            "fairly_active" = mean(minutes_fairly_active) / 60,
            "very_active" = mean(minutes_very_active) / 60)

#Convert data to long-format for plotting as a histogram
intensity_order = c("sedentary", "lightly_active", "fairly_active", "very_active")
mean_daily_intensities_long <- mean_daily_intensities_wide %>%
  tidyr::gather(key = "intensity", value = "mean_hours", -id) %>%
  mutate(intensity = factor(intensity, levels = intensity_order))

# Reorder IDs in order of "Very Active" time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
  mutate("total_active" = fairly_active + very_active) %>%
  arrange(total_active)

# Apply order of IDs to long-format data to force plot order
mean_daily_intensities_long$id <- factor(
  mean_daily_intensities_long$id,
  levels = mean_daily_intensities_wide$id)
viz_avg_time_by_intensity <- ggplot(mean_daily_intensities_long, aes(x= id, y= mean_hours, fill= intensity)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(breaks = seq(0, 24, by = 1)) +
  labs(title="Average Time by Intensity Zone",
       x = "User ID",
       y = "Total Average Daily Time (hours)")
print(viz_avg_time_by_intensity)

Preliminary Findings:

  • The large majority of people’s time is clearly being spent in a sedentary state.
  • This makes sense, since it includes sleep tracking as well as time spent awake but immobile
  • There is a wide spread of times spent non-sedentary, with no clear outliers at this point
  • The large majority of non-sedentary time is Lightly Active

Given how large the majority of non-Sedentary time is “Lightly Active”, and based on our understanding of the Intensity zones from previous sections, I will assume “Lightly Active” time includes any activity more intense than sitting down, up to an including activities like walking for leisure. Anything more intense would therefore fall into the “Fairly Active” or “Very Active” categories. With this in mind, I’ll proceed with the assumption that the “Fairly Active” and “Very Active” zones represent intentional exercise.

Next, I analyse users time spent intentionally exercising:

# Update total_active to reflect new definition
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
  mutate(total_active = fairly_active + very_active) %>%
  arrange(total_active)

mean_active_hours <- mean_daily_intensities_wide %>%
  select(id, total_active)

# Add cumulative user count for Pareto analysis
total_users <- nrow(mean_active_hours)
bin_width <- 0.25
rounding_factor <- 1 / bin_width
mean_active_hours <- mean_active_hours %>%
  mutate(total_active_rounded = round(total_active * rounding_factor) / rounding_factor) %>%
  mutate(cumulative_users = row_number() / total_users)

mean_hours <- mean(mean_active_hours$total_active)
median_hours <- median(mean_active_hours$total_active)

ymax <- 12
xmax <- 13.5
viz_active_hrs_per_week <- ggplot(mean_active_hours, aes(x=total_active)) +
  geom_histogram(binwidth = bin_width, col="white") +
  scale_x_continuous(breaks = seq(0, xmax, by = bin_width)) +
  geom_line(aes(x = total_active_rounded,
                y = cumulative_users * ymax),
            col="red") +
  geom_vline(aes(xintercept = mean_hours, col = 'red')) +
  geom_vline(aes(xintercept = median_hours), col = 'blue') +
  scale_y_continuous(name = "Number of Users",
                     breaks = seq(0, ymax, by = 1),
                     sec.axis = sec_axis(~./ymax,name = "Cumulative Percentage of Users")) +
  labs(title = "Active Hours per Week",
       x = "Hours")
viz_active_hrs_per_week <- plot_histo_pareto(mean_active_hours$total_active, bin_width = 0.25) +
  labs(title = "Active Hours per Week",
       x = "Hours")
print(viz_active_hrs_per_week)

Preliminary Findings:

  • 72.7% of the cohort (24 users) get less than 45 minutes of Active Time per week on average.
  • 18% of the cohort (6 users) do more than 1 hour a week on average.
  • Most people in the dataset are not doing much more than very light exercise.

How intensely do people exercise?

Here I analyse the data to see if there is any variation in the level of intensity of exercise between users. The variation was analysed by examining the ratio of “Fairly Active” to “Very Active” time for each user:

# For each ID, generate proportion Very Active time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
  mutate(proportion_very_active = very_active / total_active)

# Convert data to long-format for plotting as a histogram
intensity_order = c("fairly_active", "very_active")
mean_daily_intensities_long <- mean_daily_intensities_wide %>%
  tidyr::gather(key = "intensity", value = "mean_hours", -id) %>%
  mutate(intensity = factor(intensity, levels = intensity_order)) %>%
  filter(intensity == "fairly_active" | intensity == "very_active")

# Reorder IDs in order of "Very Active" time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
  arrange(proportion_very_active)

# Apply order of IDs to long-format data to force plot order
mean_daily_intensities_long$id <- factor(
  mean_daily_intensities_long$id,
  levels = mean_daily_intensities_wide$id)
viz_very_active_proportion <- plot_histo_pareto(mean_daily_intensities_wide$proportion_very_active,
                                                bin_width = 0.1) +
  labs(title = "Percentage of Active Time spent Very Active")
print(viz_very_active_proportion)

Preliminary Findings:

  • It’s quite evenly spread out from around 10% Very Active to around 90% Very Active across all users, with no clear outliers.
  • It would appear that the user group engages in a range of exercise activities of varying intensity.

Do people who track more do more-intense exercise?

The proportion of Active Time spent Very Active was plotted against the total Active Time to see if there was a correlation. From the earlier analysis of overall Active Hours per week, we can see that Active Hours are skewed right and concentrated towards the lower end: given this distribution, I also re-ran the analysis twice, once with the upper quintile of users removed, and once with the lower quintile removed. The results are plotted below.

plot_all_data <- ggplot(mean_daily_intensities_wide, aes(x=total_active, y=proportion_very_active)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE,
              color = "blue") +
  stat_cor(aes(label=..rr.label..),
           label.x=0,
           label.y=0) +
  labs(title="Proportion Very Active vs. Total Active Time",
       x = "Average Weekly Active Time (hours)",
       y = "Proportion of Very Active Time (%)")
data <- mean_daily_intensities_wide %>%
  filter(total_active <= 1.25) # id != "3977333714")

plot_no_upper_quint <- ggplot(data, aes(x=total_active, y=proportion_very_active)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE,
              color = "blue") +
  stat_cor(aes(label=..rr.label..),
           label.x=0,
           label.y=0) +
  labs(title="Proportion Very Active vs. Total Active Time (Highest Quintlie Removed)",
       x = "Average Weekly Active Time (hours)",
       y = "Proportion of Very Active Time (%)")

data <- mean_daily_intensities_wide %>%
  filter(total_active > 0.25) # id != "3977333714")

plot_no_lower_quint <- ggplot(data, aes(x=total_active, y=proportion_very_active)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE,
              color = "blue") +
  stat_cor(aes(label=..rr.label..),
           label.x=0,
           label.y=0) +
  labs(title="Proportion Very Active vs. Total Active Time (Lowest Quintile Removed)",
       x = "Average Weekly Active Time (hours)",
       y = "Proportion of Very Active Time (%)")

rm(data)

viz_very_active_vs_total_active <- plot_all_data + 
  scale_x_continuous(limits=c(0,2)) + 
  scale_y_continuous(limits=c(0,1))
print(viz_very_active_vs_total_active)

# plot_no_upper_quint <- plot_no_upper_quint + 
#   scale_x_continuous(limits=c(0,2)) + 
#   scale_y_continuous(limits=c(0,1))
# plot_no_lower_quint <- plot_no_lower_quint + 
#   scale_x_continuous(limits=c(0,2)) + 
#   scale_y_continuous(limits=c(0,1))
# 
# grid.arrange(viz_very_active_vs_total_active, plot_no_upper_quint, plot_no_lower_quint, ncol = 3)

Preliminary Findings: * Very Active Proportion was positively correlated with Active time, with an R^2 value of 0.27. * Removing the upper quintile decreased the correlation to 0.23, indicating that those doing less exercise also do a wider range of intensities * Removing the lower quintile increased the correlation to 0.32, indicating that those doing the most exercise are doing a narrower range of higher-intensity activities. * Overall, the results indicate that the majority of the cohort is engaged in a wide range of activities. There is a large subset of the group doing a low amount of varied exercise, as well as a small group of outliers doing a higher amount of more-intense exercise.

Do people prefer to walk or run?

Why do I care: If people like to run, we can market it at those people

How do we find out if people run? Steps/second Distance/second How will we know if they’re running? Speed above a certain level

I can use the src logs directly as they are logged every minute, so each data point is also an average speed in steps per minute. From this, I can set up some ranges for different walking types based on speed, count the number of steps logs in each range for each person/day, then finally average the sums over the days to get each users’ average time spent at each speed

brisk_walk_thld <- 80
running_thld <- 110

mean_movement_times_daily <- steps_src_mins_tall %>%
  mutate(activity_day = as.POSIXct(format(as.Date(activity_minute)))) %>%
  group_by(id, activity_day) %>%
  summarize(not_walking = sum(steps == 0) / 60,
            moderate_walking = sum(steps > 0 & steps <= brisk_walk_thld) / 60,
            brisk_walking = sum(steps > brisk_walk_thld & steps <= running_thld) / 60,
            running = sum(steps > running_thld) / 60)

mean_movement_times <- mean_movement_times_daily %>%
  group_by(id) %>%
  summarize(not_walking = mean(not_walking),
            moderate_walking = mean(moderate_walking),
            brisk_walking = mean(brisk_walking),
            running = mean(running)) %>%
  arrange(running)

speed_order = c("not_walking", "moderate_walking", "brisk_walking", "running")
mean_movement_times_long <- mean_movement_times %>%
  tidyr::gather(key = "speed", value = "mean_hours", -id) %>%
  mutate(speed = factor(speed, levels = speed_order))

mean_movement_times_long$id <- factor(
  mean_movement_times_long$id,
  levels = mean_movement_times$id
)
ggplot(mean_movement_times_long %>% filter(speed != "not_walking"),
       aes(x = id, y = mean_hours, fill = speed)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(breaks = seq(0,33,by=1)) +
  labs(title = "Average Daily Running Speeds by User ID",
       x = "User ID",
       y = "Average Daily Hours")

viz_mean_mod_walking <- plot_histo_pareto(mean_movement_times$moderate_walking,
                                      bin_width = 0.25) +
  labs(title = "Average Moderate Walking Hours per Week",
       x = "Average Hours per Week")
print(viz_mean_mod_walking)

viz_mean_brisk_walking <- plot_histo_pareto(mean_movement_times$brisk_walking,
                                        bin_width = 0.25) +
  labs(title = "Average Brisk Walking Hours per Week",
       x = "Average Hours per Week")
print(viz_mean_brisk_walking)

viz_mean_running <- plot_histo_pareto(mean_movement_times$running,
                                  bin_width = 0.25) +
  labs(title = "Average Running Hours per Week",
       x = "Average Hours per Week")
print(viz_mean_running)

grid.arrange(viz_mean_mod_walking,
             viz_mean_brisk_walking,
             viz_mean_running,
             ncol = 1)

Preliminary Findings:

  • Most people spend most of their time walking slowly
  • Brisk walking is somewhat more common, with 24 of 33 users logging between 0.25 and 0.5 hours per week
  • Runners are rare, with 30 of 33 users logging 0.25 hours or less per week, and the remaining three outliers logging between 0.5 and 1.25 hours per week

Do people increase their active walking hours over time?

viz_mean_mod_walking_vs_time <- get_time_coefficients_plot(mean_movement_times_daily$id,
                                                       mean_movement_times_daily$activity_day,
                                                       mean_movement_times_daily$moderate_walking) +
  labs(title = "Moderate Walking Over Time")

viz_mean_brisk_walking_vs_time <- get_time_coefficients_plot(mean_movement_times_daily$id,
                                                    mean_movement_times_daily$activity_day,
                                                    mean_movement_times_daily$brisk_walking) +
  labs(title = "Brisk Walking Over Time")

viz_mean_running_vs_time <- get_time_coefficients_plot(mean_movement_times_daily$id,
                                                    mean_movement_times_daily$activity_day,
                                                    mean_movement_times_daily$running) +
  labs(title = "Running Over Time")

grid.arrange(viz_mean_mod_walking_vs_time,
             viz_mean_brisk_walking_vs_time,
             viz_mean_running_vs_time,
             nrow = 3)

Preliminary Findings:

  • Most users actually decreased or did not change their walking time over the data period
  • This finding was consistent across all walking types
  • There is no evidence to suggest FitBit users increased their walknig times as a result of their FitBit device usage

Are people getting their steps in?

mean_daily_steps <- steps_sum_days_tall %>%
  group_by(id) %>%
  summarize(mean_steps = mean(steps_total))
viz_mean_daily_steps <- plot_histo_pareto(mean_daily_steps$mean_steps, 2000) +
  labs(title = "Average Daily Steps by User",
       x = "Average Daily Steps")
print(viz_mean_daily_steps)

Answer: Typically 7500.

Do people increase their step counts over time?

viz_mean_daily_steps_vs_time <- get_time_coefficients_plot(steps_sum_days_tall$id,
                                                steps_sum_days_tall$activity_day,
                                                steps_sum_days_tall$steps_total) +
  labs(title = "Correlation between Step Counts and Time")
print(viz_mean_daily_steps_vs_time)

Preliminary Findings:

  • As with walking types, the majority of users actually did fewer steps over time

Do people engage in stationary or mobile exercise?

Examples of stationary exercise include gym-based activities like treadmill running and weight-lifting. Examples of mobile exercise include outdoor activities like jogging, but also indoor activities like squash or, depending on the accuracy of the FitBit distance tracking, gym workouts based on rotation through various equipment and exercises spaced around the venue. Determining if users have a preference for one activity type over another can help guide the marketing for the Ivy.

I analysed stationary versus mobile activity preferences by plotting the data for Active Minutes against Active Distance. The theory was that the more stationary exercise users do, the weaker the correlation between active time and active distance should be, because users are not moving while exercising (i.e. machine-based cardio and resistance training).

distance_vs_active <- activity_sum_days_wide %>%
  select(id,
         activity_day,
         distance_moderately_active,
         distance_very_active,
         minutes_fairly_active,
         minutes_very_active) %>%
  mutate(distance_active = distance_moderately_active + distance_very_active,
         minutes_active = minutes_fairly_active + minutes_very_active) %>%
  group_by(id) %>%
  summarize(mean_distance_active = mean(distance_active),
            mean_minutes_active = mean(minutes_active))
viz_distance_vs_activity <- scatter_with_LOBF(distance_vs_active$mean_minutes_active,
                       distance_vs_active$mean_distance_active,
                       c("Mean Daily Active Minutes", "Mean Daily Active Distance")) +
  labs(caption = "\"Active\" refers to time spent either Fairly or Very Active") +
  scale_y_continuous(breaks = seq(0,10,by=1))
print(viz_distance_vs_activity)

Preliminary Findings:

  • Active Minutes correlate strongly with Active distance, with R^2 = 0.77.
  • Clearly absent from the plot is a significant number of users with high Active Minutes but low Active Distance
  • Users typically log an additional 3.6 km of distance for every hour of Active time: to achieve this in a gym setting a user would have to follow a typical pattern resembling, for example, two one-minute stationary resistance training sets, followed by one minute moving around the gym at approximately 10.8 kilometers per hour, which seems unrealistic

Overall, based on the strong correlation between Active Time and Active distance, and the scale of the distances covered by active users, it would appear that the cohort engages in a variety of mobile exercise activities. The exact breakdown of mobile vs. stationary activities has not been determined, but I believe it is safe to say that the marketing for the Ivy should represent mobile and outdoor activities more strongly over indoor, gym-based, or other stationary activities.

Readiness Score

Feature Overview

Subtitle: “Includes Resting Heart Rate, Respiratory Rate, and Cardiac Coherence functions”

Description:

  • “While you sleep at night and your body is in its calmest state, Ivy measures your resting heart rate, respiratory rate, and cardiac coherence.”
  • “Resting heart rate (RHR) is the number of heartbeats per minute, measured when the body is fully calm during the night.”
  • “The cardiac coherence parameter shows how your heart rate variability and breathing rate are synchronized.”

The Readiness Score is one of several features of the Ivy that can be marketed as a value-add over the FitBit devices in the data set. The feature makes use of several functions of the device, some of which the FitBits in the data set do not have, and builds on other features already available, like Respiratory Rate and Resting Heart Rate

The FitBit devices in the dataset don’t log any direct equivalent to the Readiness Score, nor does the documentation for the devices describe any comparable feature.

While the Blaze does have a breathing-detection function, it does not appear to be tracked over time in the data, and the Blaze manual describes it used only for “Guided Breathing” activities, which track respiratory rate only during the activity.

The Surge manual also mentions that (“Wearing your Surge while you sleep has the added benefit of making the Resting Heart Rate measurements on your dashboard more accurate”)[https://myhelp.fitbit.com/resource/manual_surge_en_US], so it appears to have that feature as well, but it is not built into an overall “Readiness Score”

This makes a potentially valuable function of the device, but also makes it difficult to investigate the existing data for evidence of how users might make use of this feature.

One aspect we can analyse is whether or not users currently wear their devices to bed.

Analysis

Do people sleep with their FitBits on?

Nighttime usage of FitBits is particularly relevant to the Readiness Score, as that function does not work unless the user wears their device to bed. If people are already frequently wearing their devices to bed, then the marketing can leverage this tendency to push the feature as an improvement on existing features. On the other hand, if users are not wearing their devices overnight, this may indicate that users are not interested in existing overnight-monitoring functions, or that there are other factors that make overnight wear unappealing, which marketing teams will need to take into account.

Which models can tell when the user is sleeping?

As of April 2016, the following wristband FitBits were available:

Model Sleep Logging Heart-rate Tracking Source
Alta Yes No User Manual
Blaze Yes Yes User Manual
Charge Yes No User Manual
Charge HR Yes Yes User Manual
Flex 1 Yes No User Manual
Flex 2 Yes No User Manual
Surge Yes Yes User Manual

Preliminary Findings:

  • Sleep logging is available on all devices, so sleep log data should indicate how the overall cohort uses this feature
  • Heart-rate logging is limited to later devices, so heartrate data will only indicate how that subset of the cohort uses that feature
Sleep Logs Analysis

For each user, I calculated the number of nights where the total sleep time logged was greater than 6 hours: this value was picked as being close enough to the recommended 8 hours of sleep to indicate the user did actually go to bed with their device on, but not so high that it would not pick up data from users who don’t get a full night’s sleep.

hours_asleep <- 6
total_nights <- 33
nights_logged_by_sleep_log <- sleep_sum_days_wide %>%
  group_by(id) %>%
  summarize(count_logged_asleep = sum(minutes_asleep_total / 60 > hours_asleep),
            count_logged_in_bed = sum(minutes_in_bed_total / 60 > hours_asleep),
            pct_logged_asleep = count_logged_asleep / total_nights,
            pct_logged_in_bed = count_logged_in_bed / total_nights)

plot_histo_pareto(nights_logged_by_sleep_log$pct_logged_in_bed, bin_width = 0.05) +
  labs(title = "Percentage of Nights Logged In Bed - Logged Users Only",
       subtitle = "Source: Sleep Log Data")

# Add missing ID values to the dataset to analyse the full cohort
missing_ids <- setdiff(unique(activity_sum_days_wide$id), nights_logged_by_sleep_log$id)
for (id in missing_ids) {
  nights_logged_by_sleep_log <- nights_logged_by_sleep_log %>%
    rbind(.,data.frame(
          id = id,
          count_logged_asleep = 0,
          count_logged_in_bed = 0,
          pct_logged_asleep = 0.0,
          pct_logged_in_bed = 0.0))
}
plot_histo_pareto(nights_logged_by_sleep_log$pct_logged_in_bed, bin_width = 0.05) +
  labs(title = "Percentage of Nights Logged In Bed - All Users",
       subtitle = "Source: Sleep Log Data")

Preliminary Findings:

  • Out of all users in the data set, very few users are logging, with 36% (12 users) logging between 0% and 5% of nights, and a majority of 51% (17 users) logging 10% of nights or fewer
  • Out of users with sleep log data, the results are more evenly spread, with 37% (9 users) logging 0% to 15% of nights, and another 42% (10 users) logging 70% to 90% of nights.

Given that the FitBit devices available at the time the dataset was recorded all have sleep-tracking functionality, these findings indicate that the majority of users are not making use of this function. One possible reason for this is indicated by the manual for the FitBit Surge, which informs the user that “wearing your tracker 24/7 does not allow your skin to breathe”

Heart-rate Analysis

The sleep logs data gives a good indication of whether users are interested in sleep logging data, but it doesn’t directly confirm whether people are wearing their devices to bed. To investigate this, I also analysed the heart-rate data for each user. This is by far the largest and least-processed data set, but it has the advantage of only generating data if the device is physically contacting the user.

Taking the brute-force approach to plotting the data for each user makes it much easier to visualise the patterns in user behaviour:

TODO: Widen out this plot

data <- heartrate_src_seconds_tall %>%
    mutate(is_nighttime = hour(heart_rate_second) >= 22 | hour(heart_rate_second) < 6)
p <- ggplot(data) +
  geom_point(aes(x = heart_rate_second,
                 y = heart_rate,
                 color = is_nighttime),
             size = 0.1,
             shape = 3) +
  facet_wrap(vars(id))
print(p)


ggsave("viz_all_heartrate_data.png",
       plot = p,
       device = "png",
       width = 3840,
       height = 2160,
       units = "px")

rm(data)

Once R is done chewing through the data, we can see that all users have clear gaps in their heart-rate logs.

The heart rate data takes the form of individual polls recording the user’s momentary heart rate. This presents a challenge compared to using the summary data, in that these individual logs must be processed to determine when the user was wearing their device continuously. Counting the number of logs over a given time period will not work, as the time between logs varies even when the device is worn. Polling times less than 60 seconds, for example, are typically between 1 and 20 seconds, as shown below:

heartrate_polling_variance <- heartrate_src_seconds_tall %>%
  group_by(id) %>%
  mutate(polling_diff = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
  filter(polling_diff < 60)
ggplot(heartrate_polling_variance) +
  geom_boxplot(aes(y = polling_diff)) +
  scale_y_continuous(breaks = seq(0,60,by=5))

This means that even choosing the median value of 5 seconds could result in an inferred “time logged” value that’s out by a factor of 5.

To build out the data, I decided on an approach where the time between logs was used to infer the time logged, but only within an acceptable time difference. If one log follows its predecessor by less than this difference, the time between the two is considered “Logged” time: if the delay is greater, we assume the device was taken off for that time.

From the polling variance analysis boxplot, we saw that the median, 75th-percentile, and Maximum values were 5 seconds, 10 seconds, and 17 seconds, respectively. Given this range, I selected 20 seconds as the maximum acceptable time between polls.

# Find nights where enough heart-rate data was logged to imply the user slept with their FitBit on

sec2hour <- 1 / 3600
# Sleep range set to between 10pm and 6am
sleep_range_stt_hour <- 22
sleep_range_end_hour <- 6
# Anything more than this time between logs is considered a removal (typical poll rate is 1-20 seconds)
max_poll_gap_secs <- 20
# Minimum six hours must be logged to be counted
min_hours_logged <- 6
nights_total <- 33

nights_logged_by_heart_rate <- heartrate_src_seconds_tall %>%
  # For logs in the AM, the night-of is set to the day before
  mutate(night_of_yday = ifelse(hour(heart_rate_second) >= sleep_range_stt_hour, yday(heart_rate_second),
                                ifelse(hour(heart_rate_second) < sleep_range_end_hour, yday(heart_rate_second) - 1,
                                       NA))) %>%
  # Only consider logs from the "sleep" range
  filter(!is.na(night_of_yday)) %>%
  # Group by id so that difftime only compares timestamps from the same user
  group_by(id) %>%
  mutate(time_since_last_poll = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
  # Remove any logs that are followed by too long a gap
  filter(time_since_last_poll <= max_poll_gap_secs) %>%
  group_by(id, night_of_yday) %>%
  # Find the total time logged as the sum of time between valid logs
  summarize(logged_time_hours = sum(time_since_last_poll) * sec2hour) %>%
  # Count up the nights where each user logged sufficient time to be considered asleep with their fitbit
  group_by(id) %>%
  summarize(nights_logged = sum(logged_time_hours > min_hours_logged),
            nights_logged_pct = nights_logged / nights_total)
# Plot histogram showing who logged what fraction of their nights
plot_histo_pareto(nights_logged_by_heart_rate$nights_logged_pct, bin_width = 0.05) +
  labs(title = "Percentage of Nights Logged In Bed - HR Users Only",
       subtitle = "Source: Heart Rate Data")

# 
# # Add missing ID values to the dataset to analyse the full cohort
# missing_ids <- setdiff(unique(activity_sum_days_wide$id), nights_logged_by_heart_rate$id)
# for (id in missing_ids) {
#   nights_logged_by_heart_rate <- nights_logged_by_heart_rate %>%
#     rbind(.,data.frame(
#           id = id,
#           nights_logged = 0,
#           nights_logged_pct = 0.0))
# }
# 
# # Plot histogram showing who logged what fraction of their nights
# plot_histo_pareto(nights_logged_by_heart_rate$nights_logged_pct, bin_width = 0.05) +
#   labs(title = "Percentage of Nights Logged In Bed - All Users",
#        subtitle = "Source: Heart Rate Data")

Given the small number of heart-rate users, I also compared the the heart-rate results directly to the sleep log results to see if there was any difference in the nights logged for individual IDs:

sleep_data_merged <- merge(nights_logged_by_sleep_log, nights_logged_by_heart_rate, by = "id", all = TRUE) %>%
  select(id, pct_logged_asleep, nights_logged_pct) %>%
  rename(sleep_logs = pct_logged_asleep,
         heartrate_logs = nights_logged_pct)
sleep_data_merged <- tidyr::gather(sleep_data_merged, key = "Variable", value = "Value", -id)

ggplot(sleep_data_merged, aes(x = id, y = Value, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7, color = "black") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
  labs(title = "Nights Logged: Sleep Logs vs. Heartrate Logs",
       x = "ID",
       y = "% of Nights Logged",
       fill = "Source")

For those IDs present in both data sets, the results were almost identical, with only three users showing variation equating to 2-3 nights, indicating the method is valid.

Preliminary Findings:

  • Out of all users, 73% (24 users) did not log any heartrate-based sleep logs.
  • Out of all users, 42% (14 users) logged heart-rate data
  • Out of those users, the results were split, with 50% (7 users) logging sleep on 35% to 90% of nights, and the other 50% logging sleep on 0% to 15% of nights
  • The heart-rate-based analysis produces the same results for individual users as the sleep-log-based analysis
  • Overall, the results reinforce the findings of the sleep-log analysis, but only for the smaller sub-set of users that have devices with heart-rate tracking.

Sleep Tracking

Feature Overview

Analysis

How many people used the sleep tracking feature?

From the previous analysis for the Readiness Score feature, we know that sleep tracking is not commonly used, with a majority of 51% (17 users) logging 10% of nights or fewer, and the remainder spread out over a range up to 90% of nights.

What quantity of sleep are users getting?

  • Sleep is ranked as Awake, Restless, or Asleep
  • Quantity and quantity can be tracked as time spent Asleep as a percentage of total logged
  • Quality can also be tracked as
sleep_sum_quant_qual <- sleep_src_mins_tall %>%
  # Set the date for sleep logs in the AM to the day before, i.e. the Night Of [date]
  mutate(sleep_date = as.Date(ifelse(hour(sleep_minute) > 12,
                                     as.Date(sleep_minute),
                                     as.Date(sleep_minute) - 1))) %>%
  # mutate(sleep_date = as.Date(sleep_date)) %>%
  group_by(id, sleep_date) %>%
  summarize(minutes_awake    = sum(sleep_rank == "Awake"),
            minutes_restless = sum(sleep_rank == "Restless"),
            minutes_asleep   = sum(sleep_rank == "Asleep"),
            minutes_total    = minutes_awake + minutes_restless + minutes_asleep,
            pct_awake        = minutes_awake / minutes_total,
            pct_restless     = minutes_restless / minutes_total,
            pct_asleep       = minutes_asleep / minutes_total)

sleep_sum_quant_qual_mean <- sleep_sum_quant_qual %>%
  group_by(id) %>%
  summarize(across(.cols = -sleep_date, \(x) mean(x, na.rm = TRUE), .names = "{.col}_mean")) %>%
  mutate(hours_total_mean = minutes_total_mean / 60)
plot_histo_pareto(sleep_sum_quant_qual_mean$hours_total_mean, bin_width = 1) +
  labs(title = "Average Hours Asleep",
       x = "Hours Asleep")

Preliminary Findings:

  • The results appear realistic: the mean and median are 6.9 and 7.4 hours, respectively, with 79% of users (19 users) getting between 6 and 9 hours of sleep on average

What quality of sleep are users getting?

plot_histo_pareto(sleep_sum_quant_qual_mean$pct_awake_mean, bin_width = 0.05) +
  labs(title = "Percentage of Sleep Logs Spent Awake",
       x= "% of Time")

plot_histo_pareto(sleep_sum_quant_qual_mean$pct_restless_mean, bin_width = 0.05) +
  labs(title = "Percentage of Sleep Logs Spent Restless",
       x= "% of Time")

plot_histo_pareto(sleep_sum_quant_qual_mean$pct_asleep_mean, bin_width = 0.05) +
  labs(title = "Percentage of Sleep Logs Spent Asleep",
       x= "% of Time")

Preliminary Findings:

  • The majority of users appear to be getting high-quality sleep, with 88% of users logging between 90% and 95% of time as Asleep
  • Time spent Awake is almost nil, with 88% of users logging less than 5% of time asleep. This is likely due to the programming of the FitBit devices, which feature an automatic sleep logging function that kicks in when it detects the user may be asleep, based on movement and heart rate data

Did sleep quantity or quality improve over time?

# get_time_coefficients_plot <- function(ids, timestamps, values)
ggplot(sleep_sum_quant_qual,
       aes(x = sleep_date,
           y = pct_asleep)) +
  geom_point() +
  geom_smooth(method = "lm",
                se = FALSE,
                color = "blue") +
    stat_cor(mapping=aes(label=..rr.label..),
             method="pearson",
             label.x=-Inf,
             label.y=Inf,
             hjust = -0.1,
             vjust = 1.1) +
  facet_wrap(vars(id))
# Brute-force Preliminary Findings: Most lines are flat, with some (5-6) that are slightly down

I analysed sleep quantity and quality over time by calculating the coefficients with time of “total hours asleep” and “percentage of time awake”, respectively:

get_time_coefficients_plot(sleep_sum_quant_qual$id,
                         sleep_sum_quant_qual$sleep_date,
                         sleep_sum_quant_qual$minutes_asleep) +
  labs(title = "Coefficient of Time: Total Time Asleep")

get_time_coefficients_plot(sleep_sum_quant_qual$id,
                           sleep_sum_quant_qual$sleep_date,
                           sleep_sum_quant_qual$pct_asleep) +
  labs(title = "Coefficient of Time: % of Time Asleep")

Preliminary Findings:

  • The correlations for sleep quantity and quality formed approximately-normal distributions around zero, indicating no consistent increase for either variable
  • Sleep quantity appeared to decrease slightly, with mean and median correlations of -0.09 and -0.13, respectively
  • Sleep quantity was approximately static, with mean and median correlations of -0.04 and -0.02, respectively
  • Overall, there is nothing to indicate that users increased the quantity or quality of their sleep over time
  • The quantity and quality of sleep for this cohort of users was already relatively good, with the majority of users getting between 6-9 hours sleep, and spending upwards of 90% of that time in a stable Asleep state.

Resting Heart Rate and Cardiac Coherence

Feature Overview

Although the FitBits in the data set do have heart rate and breathing rate tracking, there is no data for resting heart rate or cardiac coherence specifically. What I can analyse is the data on heart-rate tracking generally, specifically:

  • How often do people track their heart-rates, either during the day or during the night?
  • When people do track their heart-rates, what level of activity are they engaged in? I.e. are they only putting their trackers on to exercise, or do they wear them during lower-activity or sedentary periods?

This will help me understand whether or not the RHR and CC functions will appeal to the user base: for instance, users who rarely wear their existing heart-rate monitors, or only wear them during intense exercise, may be less likely to care about their longer-term cardiac performance.

Analysis

How often do people track their heart-rate?

Nighttime Usage

From my previous analysis on heart-rate data, we know that overnight heart-rate tracking is split, with 50% (7 users) logging sleep on 35% to 90% of nights, and the other 50% logging sleep on 0% to 15% of nights.

Daytime Usage

I can adapt the code from the nighttime analysis to investigate daytime usage:

# Find nights where enough heart-rate data was logged to imply the user slept with their FitBit on

sec2hour <- 1 / 3600
# Sleep range set to between 10pm and 6am
period_stt_hour <- 6
period_end_hour <- 22
# Typical poll rate is 1-20 seconds: increased to 60 to allow for device moving out of position during daytime movement
max_poll_gap_secs <- 60
# Minimum six hours must be logged to be counted
min_hours_logged <- 6
periods_total <- 33
logging_days <- TRUE

if (logging_days) {
  periods_logged <- heartrate_src_seconds_tall %>%
    mutate(yday = ifelse((hour(heart_rate_second) >= period_stt_hour & hour(heart_rate_second) < period_end_hour),
                         yday(heart_rate_second),
                         NA))
} else {
  # For night logging, set yday to the day before, i.e. "Night of [yday]"
  periods_logged <- heartrate_src_seconds_tall %>%
    mutate(yday = ifelse(hour(heart_rate_second) >= period_stt_hour,
                         yday(heart_rate_second),
                         ifelse(hour(heart_rate_second) < period_end_hour,
                                yday(heart_rate_second) - 1,
                                NA)))
}

periods_logged <- periods_logged %>%
  # Only consider logs within the period of interest
  filter(!is.na(yday)) %>%
  # Group by id so that difftime only compares timestamps from the same user
  group_by(id) %>%
  mutate(time_since_last_poll = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
  # Remove any logs that are followed by too long a gap
  filter(time_since_last_poll <= max_poll_gap_secs) %>%
  group_by(id, yday) %>%
  # Find the total time logged as the sum of time between valid logs
  summarize(logged_time_hours = sum(time_since_last_poll) * sec2hour) %>%
  # Count up the nights where each user logged sufficient time to be considered asleep with their fitbit
  group_by(id) %>%
  summarize(periods = sum(logged_time_hours > min_hours_logged),
            periods_pct = periods / periods_total)
# Plot histogram showing who logged what fraction of their nights
plot_histo_pareto(periods_logged$periods_pct, bin_width = 0.05) +
  labs(title = "Percentage of Periods Logged",
       subtitle = "Source: Heart Rate Data")

Preliminary Findings:

  • Daytime logging is much more popular than nighttime logging
  • 50% of users (7 users) logged 6 hours on between 80% to 95% of days
  • The other 50% of users logged on between 0% and 75% of days

What levels of activity do people typically track?

First the naive approach: What percentage of various users’ time is spent in what heart-rate zone?

# From the Charge HR manual:
#   Sedentary = 0-50% Max HR
#   Lightly Active = 50-70% Max HR
#   Fairly Active = 70-85% Max HR
#   Very Active = 85-100% Max HR
# Max HR = 220 - age
# Arbitrarily set median age at 30 for initial analysis
assumed_median_age <- 30
max_heart_rate <- 220 - assumed_median_age
hr_zone_1 <- 0.5 * max_heart_rate
hr_zone_2 <- 0.7 * max_heart_rate
hr_zone_3 <- 0.85 * max_heart_rate

hr_zone_data <- heartrate_src_seconds_tall %>%
  mutate(hr_zone = ifelse(heart_rate >= 0 & heart_rate < hr_zone_1, "sedentary",
                   ifelse(heart_rate >= hr_zone_1 & heart_rate < hr_zone_2, "lightly_active",
                   ifelse(heart_rate >= hr_zone_2 & heart_rate < hr_zone_3, "fairly_active",
                   ifelse(heart_rate >= hr_zone_3, "very_active", NA))))) %>%
  group_by(id) %>%
  summarize(sedentary = sum(hr_zone == "sedentary"),
            lightly_active = sum(hr_zone == "lightly_active"),
            fairly_active = sum(hr_zone == "fairly_active"),
            very_active = sum(hr_zone == "very_active"))

#Convert data to long-format for plotting as a histogram
hr_zone_order = c("sedentary", "lightly_active", "fairly_active", "very_active")
hr_zone_data_long <- hr_zone_data %>%
  tidyr::gather(key = "hr_zone", value = "count", -id) %>%
  mutate(hr_zone = factor(hr_zone, levels = hr_zone_order))
ggplot(hr_zone_data_long, aes(x= id, y= count, fill= hr_zone)) +
  geom_bar(stat = "identity",
           position = "fill") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title="Count of HR Logs by HR Zone",
       x = "User ID",
       y = "Count")

Preliminary Findings:

  • The majority of logs for all but one user are in the “Sedentary” category, which equates to less than 95 bpm
  • It is fair to assume therefore that users are not only using their heart-rate trackers for exercise
  • There should be ample data to determine a user’s resting heart rate from the daytime data, even if there is little nighttime data

Features without Data

Feature Overview

Several functions and features of the Ivy are not present on the FitBits in the data set. Given the lack of data, it’s not possible to directly analyse existing user behaviours with respect to these features, which include:

  • Respiratory Rate
  • Hydration tracking
  • Mindfulness Minute Tracking
  • Menstrual Cycle
  • Wellness Score

TODO Step 5: Share

In which I present all of my key findings and their supporting visualisations

Most users don’t get much exercise

Visualising how much time individual users spent in each zone shows a clear bias towards sedentary and lightly-active zones:

Visualising the frequency of Active Hours across all users shows a right-skewed distribution, with most users getting little exercise and a small number of outliers logging higher numbers:

For those who do exercise, intensity varied widely

Users get around 7500 steps in per day on average, but this is varied

Walking is much more common than running:

Users do not increase their step count, walking time, or walking intensity over time

Users prefer to engage in mobile exercise

Sleep logging was not common

Among those who did log, usage was spread out

Heart-rate data indicates overnight wear is split between low- and high-frequency users Device comfort may be a factor

TODO Step 6: Act

In which I summarise my recommendations

Recommendations: Heart Rate Tracking, Activity Tracking, and Step Counts

  • To appeal to the FitBit market, the marketing for the Ivy’s exercise tracking should highlight a broad range of activities ranging from gentle to more intense. The marketing should not target highly-athletic groups, as these represent only a small part of the user base.
  • The marketing for the Ivy should avoid any claims that users will increase their activity levels or their step counts as a result of their purchasing the device, as the data does not support this claim.
  • The cohort does appear to engage in outdoor activities that cover long distances such as running, as opposed to stationary activities such as gym workouts, so the marketing can highlight activities like light jogging, walking dogs, and cycling

Recommendations: Readiness Score

  • Overnight usage of FitBits appears to be split between a large majority who rarely wear their devices overnight, and a smaller minority who wear more frequently, up to 90% of the time
  • The Readiness Score can therefore be marketed as a value-add feature that produces more useful information when worn to bed than competing products
  • The non-allergenic materials used in the Ivy should be used to highlight the comfort aspects of the device when marketing features that require overnight wearing, as discomfort from overnight wear may be one of the factors contributing to the low utilisation of sleep logging in existing products.

Recommendations: Sleep Tracking

  • Most users are not logging sleep

  • Those that are appear to already be getting high-quality sleep

  • There is nothing to indicate the use of the FitBits improved the quantity or quality of their sleep above this high baseline

  • Don’t market this feature as something that will improve your sleep

  • Instead, given the consistency of users in getting their sleep, as evidenced by the small correlation with time for sleep quantity and quality, the Sleep Tracking could be marketed more as a way of tracking and reinforcing the user’s existing good sleep habits

  • As with the Readiness Score, marketing teams could frame the feature more as a way of getting a heads-up on days where you haven’t slept well, since most of the time the feature would just tell users (from this cohort, at least) that they just had a reasonably good night’s sleep. This would work well with the “track existing habits” marketing angle: think “You’re great at giving your body the sleep it needs, but for those nights where things just don’t work out, Ivy can give you guidance on whether you’re up for the day, or better off taking it down a notch to recover.”

  • If it exists, highlight a manual-activation mode for the Sleep Tracking mode. The low amount of non-Asleep data in this data set may be a result of people mostly using the automatic mode, which is convenient, but doesn’t track time spent trying to get to sleep: a manual-activation feature could therefore be pushed harder as a sort of second feature that gets better data on how you go getting to sleep.

Recommendations: RHR and CC

  • In order to match current user behaviours, the marketing for the Resting Heart Rate should emphasise daytime usage: for instance, advertise that the feature will help understand your resting heart rate during your work day,
  • The data indicates that users typically do not log overnight, which may impact the performance of this feature: part of the usefulness of the feature comes from calibrating the data based on overnight heart-rate, so if users are not wearing overnight, the resting heart rate value may be skewed upward s as it is based primarily on daytime usage
  • The marketing should therefore encourage users to wear overnight, highlighting the synergy between this and the Readiness Score feature to maximise the likelihood that users wear overnight, and therefore the accuracy of both features
  • If users do not wear their devices overnight, they are unlikely to see the benefits and may consider other, less-featured but cheaper devices for their next purchase

Recommendations: Features without Data

  • These features can all be marketed as improvements and additions over the functionality of existing FitBits.
  • As discussed with regards to the Readiness score, marketing for any feature that requires overnight wear, like the Cardiac Coherence tracking, should take into account the low utilisation of existing overnight functions like sleep tracking seen in the existing data.

TODO Appendixes

Appendix A: Code Snippets

Clean and update variable names

# Function Declarations ----

rename_df_variables <- function(df_name, var_mods) {
  cat("Renaming variables in \"",df_name,"\"...\n", sep = "")
  df <- get(df_name)
  # Check each var name requiring correction against the var names in the df
  for (i in 1:nrow(var_mods)) {
    var_old = var_mods$var_old[i]
    if (!(var_old %in% colnames(df))) {
      next
    }
    # If found, make sure the conversion is applicable to this or all dfs
    tbl <- var_mods$tbl[i]
    if (tbl != "" && tbl != df_name) {
      next
    }
    # Perform the conversion if all checks passed
    var_new = var_mods$var_new[i]
    cat("df: ",df_name, "\tvar_old: ",var_old,"\t",sep="")
    cat("var_new: ",var_new,"\t", sep="")
    cat("tbl: ",tbl,"    ", sep="")
    cat("Replacing... ", sep = "")
    df <- df %>% rename(!!var_new := !!var_old)
    cat("Done.\n", sep = "")
  }
  rm(i)
  cat("Renaming variables in \"",df_name,"\" complete.\n", sep = "")
  return(df)
}

# Global Variable Declarations ----

var_mods <- data.frame(
  var_old = character(0),
  var_new  = character(0),
  type_new = character(0),
  tbl = character(0)
)

# WARNING: Ensure table-specific modifications (tbl != "") are positioned above non-specific modifications with matching var_old/var_new values: only the first modification in the list will be applied to matching variables.
# TODO: Eliminate this issue by modifying code to warn/handle conflicting rows

var_mods <- var_mods %>%
  rbind(.,data.frame(var_old="date",                       var_new="bodycomp_datetime",          type_new="POSIXct", tbl="bodycomp_src_logs_wide")) %>%
  rbind(.,data.frame(var_old="time",                       var_new="heart_rate_second",          type_new="POSIXct", tbl="heartrate_src_seconds_tall")) %>%
  rbind(.,data.frame(var_old="value",                      var_new="heart_rate",                 type_new="",        tbl="heartrate_src_seconds_tall")) %>%
  rbind(.,data.frame(var_old="date",                       var_new="sleep_minute",               type_new="POSIXct", tbl="sleep_src_mins_tall")) %>%
  rbind(.,data.frame(var_old="value",                      var_new="sleep_rank",                 type_new="",        tbl="sleep_src_mins_tall")) %>%
  rbind(.,data.frame(var_old="",                           var_new="activity_hour",              type_new="POSIXct", tbl="")) %>%
  rbind(.,data.frame(var_old="",                           var_new="activity_minute",            type_new="POSIXct", tbl="")) %>%
  rbind(.,data.frame(var_old="",                           var_new="sleep_day",                  type_new="POSIXct", tbl="")) %>%
  rbind(.,data.frame(var_old="",                           var_new="id",                         type_new="character", tbl="")) %>%
  rbind(.,data.frame(var_old="log_id",                     var_new="sleep_log_id",               type_new="character", tbl="sleep_src_mins_tall")) %>%
  rbind(.,data.frame(var_old="log_id",                     var_new="bodycomp_log_id",            type_new="character", tbl="bodycomp_src_logs_wide")) %>%
  rbind(.,data.frame(var_old="activity_date",              var_new="activity_day",               type_new="Date",   tbl="")) %>%
  rbind(.,data.frame(var_old="fairly_active_distance",     var_new="distance_fairly_active",     type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="fairly_active_minutes",      var_new="minutes_fairly_active",      type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="light_active_distance",      var_new="distance_lightly_active",    type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="light_active_minutes",       var_new="minutes_lightly_active",     type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="lightly_active_distance",    var_new="distance_lightly_active",    type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="lightly_active_minutes",     var_new="minutes_lightly_active",     type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="logged_activities_distance", var_new="distance_logged_activities", type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="me_ts",                      var_new="mets",                       type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="moderately_active_distance", var_new="distance_moderately_active", type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="sedentary_active_distance",  var_new="distance_sedentary",         type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="sedentary_distance",         var_new="distance_sedentary",         type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="sedentary_active_minutes",   var_new="minutes_sedentary",          type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="sedentary_minutes",          var_new="minutes_sedentary",          type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="step_total",                 var_new="steps_total",                type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_distance",             var_new="distance_total",             type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_intensity",            var_new="intensity_total",            type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_minutes_asleep",       var_new="minutes_asleep_total",       type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_sleep_records",        var_new="sleep_records_total",        type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_steps",                var_new="steps_total",                type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_time_in_bed",          var_new="minutes_in_bed_total",       type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="tracker_distance",           var_new="distance_tracker",           type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="very_active_distance",       var_new="distance_very_active",       type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="very_active_minutes",        var_new="minutes_very_active",        type_new="",        tbl=""))